AITopics | approximate policy iteration

Collaborating Authors

approximate policy iteration

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Performance Bounds for Policy-Based Average Reward Reinforcement Learning Algorithms

Neural Information Processing SystemsFeb-10-2026, 18:58:38 GMT

Reinforcement Learning algorithms can be broadly classified into value-based methods and policy-based methods.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

Neural Information Processing Systems

Country:

North America > United States > Illinois > Champaign County > Urbana (0.14)
North America > United States > Massachusetts > Middlesex County > Belmont (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Finite-time Analysis of Approximate Policy Iteration for the Linear Quadratic Regulator

Neural Information Processing SystemsDec-25-2025, 20:41:56 GMT

We study the sample complexity of approximate policy iteration (PI) for the Linear Quadratic Regulator (LQR), building on a recent line of work using LQR as a testbed to understand the limits of reinforcement learning (RL) algorithms on continuous control tasks. Our analysis quantifies the tension between policy improvement and policy evaluation, and suggests that policy evaluation is the dominant factor in terms of sample complexity. Specifically, we show that to obtain a controller that is within $\varepsilon$ of the optimal LQR controller, each step of policy evaluation requires at most $(n+d)^3/\varepsilon^2$ samples, where $n$ is the dimension of the state vector and $d$ is the dimension of the input vector. On the other hand, only $\log(1/\varepsilon)$ policy improvement steps suffice, resulting in an overall sample complexity of $(n+d)^3 \varepsilon^{-2} \log(1/\varepsilon)$. We furthermore build on our analysis and construct a simple adaptive procedure based on $\varepsilon$-greedy exploration which relies on approximate PI as a sub-routine and obtains $T^{2/3}$ regret, improving upon a recent result of Abbasi-Yadkori et al. 2019.

approximate policy iteration, finite-time analysis, linear quadratic regulator, (7 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.60)

Add feedback

Performance Bounds for Policy-Based Average Reward Reinforcement Learning Algorithms

Neural Information Processing SystemsOct-8-2025, 12:47:17 GMT

Reinforcement Learning algorithms can be broadly classified into value-based methods and policy-based methods.

algorithm, iteration, policy iteration, (15 more...)

Neural Information Processing Systems

Country:

North America > United States > Illinois > Champaign County > Urbana (0.14)
North America > United States > Massachusetts > Middlesex County > Belmont (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Finite-time Analysis of Approximate Policy Iteration for the Linear Quadratic Regulator

Neural Information Processing SystemsOct-10-2024, 17:27:18 GMT

We study the sample complexity of approximate policy iteration (PI) for the Linear Quadratic Regulator (LQR), building on a recent line of work using LQR as a testbed to understand the limits of reinforcement learning (RL) algorithms on continuous control tasks. Our analysis quantifies the tension between policy improvement and policy evaluation, and suggests that policy evaluation is the dominant factor in terms of sample complexity. Specifically, we show that to obtain a controller that is within \varepsilon of the optimal LQR controller, each step of policy evaluation requires at most (n d) 3/\varepsilon 2 samples, where n is the dimension of the state vector and d is the dimension of the input vector. On the other hand, only \log(1/\varepsilon) policy improvement steps suffice, resulting in an overall sample complexity of (n d) 3 \varepsilon {-2} \log(1/\varepsilon) . We furthermore build on our analysis and construct a simple adaptive procedure based on \varepsilon -greedy exploration which relies on approximate PI as a sub-routine and obtains T {2/3} regret, improving upon a recent result of Abbasi-Yadkori et al. 2019.

approximate policy iteration, linear quadratic regulator, varepsilon, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.63)

Add feedback

Deep reinforcement learning with symmetric data augmentation applied for aircraft lateral attitude tracking control

Li, Yifei, van Kampen, Erik-jan

arXiv.org Artificial IntelligenceJul-13-2024

-- Symmetry is an essential property in some dynamical systems that can be exploited for state transition prediction and control policy optimization. This paper develops two symmetry-integrated Reinforcement Learning (RL) algorithms based on standard Deep Deterministic Policy Gradient (DDPG), which leverage environment symmetry to augment explored transition samples of a Markov Decision Process(MDP). The firstly developed algorithm is named as Deep Deterministic Policy Gradient with Symmetric Data Augmentation (DDPG-SDA), which enriches dataset of standard DDPG algorithm by symmetric data augmentation method under symmetry assumption of a dynamical system. T o further improve sample utilization efficiency, the second developed RL algorithm incorporates one extra critic network, which is independently trained with augmented dataset. A two-step approximate policy iteration method is proposed to integrate training for two critic networks and one actor network. The resulting RL algorithm is named as Deep Deterministic Policy Gradient with Symmetric Critic Augmentation (DDPG-SCA). Simulation results demonstrate enhanced sample efficiency and tracking performance of developed two RL algorithms in aircraft lateral tracking control task. I. INTRODUCTION Symmetry property commonly exists in the motions of various mechanical systems, such as aircrafts[1], cars[2] and robotic arms[3]. The common cognition for symmetry property is that the trajectories are symmetric with respect to a reference plane. To be more specific, the knowledge of the trajectory in one symmetric side is predicable according to the knowledge of the trajectory in the other side.

algorithm, policy iteration, symmetry, (15 more...)

arXiv.org Artificial Intelligence

2407.11077

Country:

Europe > Netherlands > South Holland > Delft (0.04)
North America > United States > Massachusetts (0.04)
North America > United States > Florida > Broward County > Hollywood (0.04)
(3 more...)

Genre: Research Report > New Finding (0.34)

Industry:

Transportation (0.47)
Aerospace & Defense (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

A reinterpretation of the policy oscillation phenomenon in approximate policy iteration

Neural Information Processing SystemsMar-15-2024, 12:04:23 GMT

A majority of approximate dynamic programming approaches to the reinforcement learning problem can be categorized into greedy value function methods and value-based policy gradient methods. The former approach, although fast, is well known to be susceptible to the policy oscillation phenomenon. We take a fresh view to this phenomenon by casting a considerable subset of the former approach as a limiting special case of the latter. We explain the phenomenon in terms of this view and illustrate the underlying mechanism with artificial examples. We also use it to derive the constrained natural actor-critic algorithm that can interpolate between the aforementioned approaches. In addition, it has been suggested in the literature that the oscillation phenomenon might be subtly connected to the grossly suboptimal performance in the Tetris benchmark problem of all attempted approximate dynamic programming methods. We report empirical evidence against such a connection and in favor of an alternative explanation. Finally, we report scores in the Tetris problem that improve on existing dynamic programming based results.

iteration, oscillation phenomenon, policy iteration, (14 more...)

Neural Information Processing Systems

Country:

North America > United States > Massachusetts > Hampshire County > Amherst (0.14)
Europe > Finland (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Asia > Middle East > Jordan (0.04)

Industry: Education (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.90)

Add feedback

Learning from Limited Demonstrations Beomjoon Kim Amir-massoud Farahmand School of Computer Science School of Computer Science McGill University

Neural Information Processing SystemsMar-14-2024, 01:07:54 GMT

We propose a Learning from Demonstration (LfD) algorithm which leverages expert data, even if they are very few or inaccurate. We achieve this by using both expert data, as well as reinforcement signals gathered through trial-and-error interactions with the environment. The key idea of our approach, Approximate Policy Iteration with Demonstration (APID), is that expert's suggestions are used to define linear constraints which guide the optimization performed by Approximate Policy Iteration. We prove an upper bound on the Bellman error of the estimate computed by APID at each iteration. Moreover, we show empirically that APID outperforms pure Approximate Policy Iteration, a state-of-the-art LfD algorithm, and supervised learning in a variety of scenarios, including when very few and/or suboptimal demonstrations are available. Our experiments include simulations as well as a real robot path-finding task.

algorithm, demonstration, iteration, (13 more...)

Neural Information Processing Systems

Country: North America > Canada > Quebec > Montreal (0.65)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)

Add feedback

Performance Bounds for Policy-Based Average Reward Reinforcement Learning Algorithms

Murthy, Yashaswini, Moharrami, Mehrdad, Srikant, R.

arXiv.org Artificial IntelligenceJun-27-2023

Many policy-based reinforcement learning (RL) algorithms can be viewed as instantiations of approximate policy iteration (PI), i.e., where policy improvement and policy evaluation are both performed approximately. In applications where the average reward objective is the meaningful performance metric, discounted reward formulations are often used with the discount factor being close to $1,$ which is equivalent to making the expected horizon very large. However, the corresponding theoretical bounds for error performance scale with the square of the horizon. Thus, even after dividing the total reward by the length of the horizon, the corresponding performance bounds for average reward problems go to infinity. Therefore, an open problem has been to obtain meaningful performance bounds for approximate PI and RL algorithms for the average-reward setting. In this paper, we solve this open problem by obtaining the first finite-time error bounds for average-reward MDPs, and show that the asymptotic error goes to zero in the limit as policy evaluation and policy improvement errors go to zero.

artificial intelligence, machine learning, reinforcement learning, (18 more...)

arXiv.org Artificial Intelligence

2302.0145

Country:

North America > United States > Illinois > Champaign County > Urbana (0.14)
North America > United States > Massachusetts > Middlesex County > Belmont (0.04)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

A Convergent Form of Approximate Policy Iteration

Neural Information Processing SystemsApr-6-2023, 16:27:36 GMT

We study a new, model-free form of approximate policy iteration which uses Sarsa updates with linear state-action value function approximation for policy evaluation, and a "policy improvement operator" to generate a new policy based on the learned state-action values. We prove that if the policy improvement operator produces -soft policies and is Lipschitz continuous in the action values, with a constant that is not too large, then the approximate policy iteration algorithm converges to a unique solu- tion from any initial policy. To our knowledge, this is the first conver- gence result for any form of approximate policy iteration under similar computational-resource assumptions.

approximate policy iteration, convergent form

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.50)

Add feedback

Approximate Policy Iteration with a Policy Language Bias

Neural Information Processing SystemsApr-6-2023, 16:04:04 GMT

We explore approximate policy iteration, replacing the usual cost- function learning step with a learning step in policy space. We give policy-language biases that enable solution of very large relational Markov decision processes (MDPs) that no previous technique can solve. In particular, we induce high-quality domain-specific planners for clas- sical planning domains (both deterministic and stochastic variants) by solving such domains as extremely large MDPs.

approximate policy iteration, policy language bias

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.57)

Add feedback